Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors
نویسندگان
چکیده
We introduce a high performance, multi-threaded realization of the gemm kernel for ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether corresponding author is correctly identified. floating point operands. Our code especially designed efficient machine learning inference (and to certain extent, also training) deep neural networks. The results on NVIDIA Carmel multicore processor, which implements architecture, show considerable performance gains kernel, close theoretical peak acceleration could be expected when moving from 32-bit arithmetic/data 16-bit. Combined type convolution operator arising in convolutional networks, speed-ups are more modest though still relevant.
منابع مشابه
Sparse Matrix-vector Multiplication on Nvidia Gpu
In this paper, we present our work on developing a new matrix format and a new sparse matrix-vector multiplication algorithm. The matrix format is HEC, which is a hybrid format. This matrix format is efficient for sparse matrix-vector multiplication and is friendly to preconditioner. Numerical experiments show that our sparse matrix-vector multiplication algorithm is efficient on
متن کاملLow precision storage for deep learning
Multipliers are the most space and power-hungry arithmetic operators of the digital implementation of deep neural networks. We train a set of state-of-the-art neural networks (Maxout networks) on three benchmark datasets: MNIST, CIFAR-10 and SVHN. They are trained with three distinct formats: floating point, fixed point and dynamic fixed point. For each of those datasets and for each of those f...
متن کاملStrassen's matrix multiplication for customisable processors
Strassen S algorithm is an efficient method for mulliplying large matrices. We explore various ways of mapping Strassen ' s algorithm inlo reconfgurable hardware that contains one or more customisable instruction processors. Our approach has been implemented using Nios processors with custom inslrucfions and with custom-designed coprocessors, taking advantage of the additional logic and memory ...
متن کاملImplementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs
We discuss implementing blocked sparse matrix-vector multiplication for NVIDIA GPUs. We outline an algorithm and various optimizations, and identify potential future improvements and challenging tasks. In comparison with previously published implementation, our implementation is faster on matrices having many high fill-ratio blocks but slower on matrices with low number of non-zero elements per...
متن کاملMatrix Multiplication on Three Heterogeneous Processors
We present a new algorithm specifically designed to perform matrix multiplication on three heterogeneous processors. This algorithm is an extension of the ‘square-corner’ algorithm designed for two-processor architectures [2]. For three processors, this algorithm partitions data in a way which on a fully-connected network minimizes the total volume of communication (TVC) between the processors ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: The Journal of Supercomputing
سال: 2021
ISSN: ['0920-8542', '1573-0484']
DOI: https://doi.org/10.1007/s11227-021-03636-4